Introduction to Neutrality Statistics and Signs of Selection


Homework for Lab 5: DUE Friday, November 9th


Readings:

In this lab, we we’ll take our first look at neutrality statistics, which are statistical tests designed to measure if whether SNPs appear to be undergoing selection within a population. We’ll go over what each of the statistics are, what they mean, and how to interpret them, and then we’ll use them on our own data (with the exception of our final statistic, iHS, which is for whole sequence data, not variant calls).

First, let’s get an understanding of what each of these statistics might mean when calculated for our populations…

Tajima’s D

This is a fairly common neutrality statistic. For this lab, we’ve read the original paper by Tajima, which details the development of his D statistic as a way of testing for the neutral mutation hypothesis (or Kimura’s theory of neutral mutation, as we’ve discussed it in class). Tajima’s D tests whether the pattern of mutations found in a particular region of the genome in a population is following the assumptions of neutrality models; in other words, if there has been selection for or against a particular allele in the population, the pattern seen will violate the assumptions of the neutral model.

With Tajima’s D, the closer to 0 the D statistic is, the closer the locus being tested is to meeting the assumptions of neutrality. In other words, when D is close to 0 there are no alleles over- or under-represented at that locus across the population, suggesting that selection is not acting on that locus in the population. If the D statistic is a negative number, this means the population is experiencing purifying or negative selection against an allele at that locus. Conversely, if the D statistic is positive, this means that the population is experiencing positive selection at the locus.

Today, we’ll calculate Tajima’s D values for our populations in order to determine if any directional selection is happening at the UCP1 locus.

Fu and Li’s D and F

Fu and Li’s D and F are also fairly straightforward tests derived from the same \(\Theta\) statistics as Tajima’s D. Like Tajima’s D, these statistics are also based upon Kimura’s neutrality statistic, again meaning that anything deviating from 0 is a violation of neutrality. What these statistics do, specifically, is measure the number of singleton or private mutations (i.e., the number of people within a sample that have a novel mutation specific to themselves). While the D statistic is based on the difference between the number of singletons in the population and the total number of mutations in population, the F statistic is based on the difference between the number of singletons and the average number of nucleotide differences between pairs of sequences. This difference in how they’re derived makes it possible for one statistic to be positive while the other is negative in the same population.

We can interpret these statistics like so:

  • If the statistics are strongly negative, this shows an excess of singletons in the population. This means there are quite a few people whose genomic variants don’t match each other. Such a situation could be from rapid population growth (meaning everyone is closely related and so there’s been no time for mutation to catch up to the increased numbers) or a selective sweep (see below for a formal definition) further back in the population’s history, meaning that strong directional selection for a certain mutation led to an overrepresentation of that mutation (at the expense of other variants at that SNP).

  • If the statistics are positive, this indicates an excess of old/ancestral variants, that have been selected for in the past. In other words, there are very few unique variants, and the variants that are present are carried by a lot of individuals.

We will calculate Fu and Li’s D and F statistics for our data today.

Integrated Haplotype Score (iHS)

Recent positive selection usually happens by selective sweeps (i.e. the rapid spread of a particular haplotype, driven by selection for particular alleles within that haplotype), which not only act upon the SNP that corresponds to the phenotype under selection, but the entire haplotype or linkage region in which that SNP is found. When this happens, the alleles in these selected haplotypes become almost entirely homozygous, because selection on this region of the genome is so strong. An iHS score, or an Integrated Haplotype Score, is a test that can be used to measure the amount of recent positive selection experienced by an allele indirectly by looking for evidence of selective sweeps. This test identifies extended haplotypes (large sets of SNPs on a single region of the chromosome) and looking at how many of the SNPs are homozygous in each haplotype. Mass homozygosity in a particular haplotype is recorded as a high iHS score, while low levels of homozygosity is recorded as a low iHS score.

Learning Outcomes

Part 1: Getting Started

Log in to the SCC and bring up the R Studio window as usual:

module load gcc
module load R
rstudio &

Next, we need to install the package PopGenome in our R Studio space, because the SCC doesn’t have it pre-installed.

#We need to install this package
install.packages("PopGenome")

Next, load in the packages we need to use:

library(vcfR)
library(PopGenome)
library(pegas)

Finally, we need to format our data for analysis. PopGenome is a bit funky about how they look for the files in your directory, so we need to move some files around in our folders to make the GENOME object that the PopGenome package will work with. Specifically, if you haven’t already done so, we need to make a new directory within our working directory, and put a copy of our data into that folder. Do this in your SCC space using the mkdir and cp commands (remember to use your population name rather than YRI!):

mkdir YRI
cp YRI.vcf YRI/YRI.vcf

Now, we can make our GENOME object by directing the function to the folder we just created. Additionally, we’ll make a DNAbin object (as we have done before) to save for later…
Here we’ll make our GENOME object:

YRIgenome <- readData("YRI", format = "VCF")
## |            :            |            :            | 100 %
## |====================================================
YRIgenome
## -----
## Modules:
## -----
##                 Calculation             Description         Get.the.Result
## 1                  readData            Reading data           get.sum.data
## 2          neutrality.stats        Neutrality tests         get.neutrality
## 3             linkage.stats  Linkage disequilibrium            get.linkage
## 4              recomb.stats           Recombination             get.recomb
## 5                F_ST.stats          Fixation index get.F_ST,get.diversity
## 6           diversity.stats             Diversities          get.diversity
## 7              sweeps.stats        Selective sweeps             get.sweeps
## 8                       MKT  McDonald-Kreitman test                get.MKT
## 9              detail.stats        Mixed statistics             get.detail
## 10                       MS   Coalescent simulation                      @
## 11           --------------             -----------          -------------
## 12          set.populations Defines the populations                       
## 13 sliding.window.transform          Sliding window                       
## 14           splitting.data         Splits the data                       
## 15               show.slots        ?provided slots?                       
## 16               get.status  Status of calculations

And here we’ll make our DNAbin object:

YRI <- read.vcfR("YRI/YRI.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 130
##   header_line: 131
##   variant count: 207
##   column count: 117
## 
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 207
##   Character matrix gt cols: 117
##   skip: 0
##   nrows: 207
##   row_num: 0
## 
Processed variant: 207
## All variants processed
YRIdna <- vcfR2DNAbin(YRI)
## After extracting indels, 192 variants remain.
## 
Variant 3 processed
YRIdna
## 216 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 192 
## 
## Labels:
## NA18486_0
## NA18486_1
## NA18488_0
## NA18488_1
## NA18489_0
## NA18489_1
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.265 0.321 0.275 0.139 
## (Total: 41.47 kb)

Part 2: Fu and Li’s D and F

The function neutrality.stats from PopGenome conveniently calculates several different neutrality statistics for a given population. We’ll use it now to calculate Fu and Li’s D and F:

#Remember to replace the varible name with your own GENOME object!
neut <- neutrality.stats(YRIgenome)
## |            :            |            :            | 100 %
## |====================================================
get.neutrality(neut)
##       neurality stats
## pop 1 Numeric,9
#To see results: 

neut@Fu.Li.F
##         pop 1
## [1,] 0.140358
neut@Fu.Li.D
##          pop 1
## [1,] 0.1403109

As we can see here, the YRI population has D and F statistics that are pretty close to 0. This suggests neither an excess of singleton mutations in this population, nor an excess of ancestral mutations. Put simply, the UCP1 region in the Yoruban population from Ibadan hasn’t evolved much in recent history, which we might expect from an African population with little experience with extreme cold temperatures.

Before we move on, think about what the results for your population means, based on how we interpret Fu and Li’s D. Is there an excess of singletons? Is there an excess of old/ancestral mutations? What does that tell you about your population’s history? We will come back to this at the end of class.

Part 3: Tajima’s D

Now we’ll look at Tajima’s D! When we ran the neutrality.stats function in PopGenome, it also calcuated our Tajima’s D statistic. We can preview our Tajima’s D results from the PopGenome output…

neut@Tajima.D
##           pop 1
## [1,] 0.07245427

But, we can get more useful information by using the tajima.test function in the pegas package. This function will use the DNAbin object that we created earlier as an input. It will not only give us the same D statistic as was calculated by the neutrality.stats function, which is all well and good, but will also give us a P-value associated with our test. This is important, because it will give us a sense of how significant the results are of our neutrality test.

tajima <- tajima.test(YRIdna)
tajima
## $D
## [1] 0.07245427
## 
## $Pval.normal
## [1] 0.9422404
## 
## $Pval.beta
## [1] 0.8974915

What we see from YRI is, again, what we might expect from a population that has been living in a warm climate for many generations: the UCP1 locus in this population does not appear to be subject to selection. This suggests that the only evolution (remember, that’s a change in allele frequencies over generations, and does not have to be due to selection) happening in the population is selectively neutral.

Interpret your own population’s Tajima’s D value. First, compare the D value given by this function and the one given by the neutrality.stats function. Are they the same? What is the D statistic for your population, and what does it mean in terms of whether and what kind of selection might be occuring on your population? We will revisit this question at the end of the lab.

Part 4: An Example of iHS… with Cows



Now, we will run our example of iHS. To run this example, I’m using the rehh package, which is specifically designed to look for Extended Haplotype Homozygosity, or evidence of selective sweeps. We can use this package to look at an example of calculating an iHS score. For a fun digression from human data, the genomic dataset included in the rehh package is from cows! You can use 1000 Genomes Project data with this package, but that requires you to map and phase the data with other genomic analysis-specific programs such as SHAPEIT and Beagle… that’s a bit more than we’re ready to do. Sooo… we’ll just use the cows. If you’d like to learn more about the cow genome, you can read about (and query) the cow genome here. You can also find the cow genome on Ensembl, but it’s unfortunately not yet annotated there (meaning we cannot yet easily make the connection between a chromosomal location and the genes it contains that are similar to those of humans). Even though our example population is a bit more bovine than usual, as we walk through this example, make some predictions about what an iHS score in your population might look like.

The first thing we’ll do is pull the example files from the package and create a haplohh object, which is what the rehh package reads for analyses:

library(rehh)
#puts example files in our working directory
make.example.files()
#read in files as a haplohh object
hap <- data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp", recode.allele = TRUE, chr.name = 12)
## Map file seems OK: 1424  SNPs declared for chromosome 12 
## Standard rehh input file assumed
## Alleles are being recoded according to map file as:
##  0 (missing data), 1 (ancestral allele) or 2 (derived allele)
## Discard Haplotype with less than  100 % of genotyped SNPs
## No haplotype discarded
## Discard SNPs genotyped on less than  100 % of haplotypes
## No SNP discarded
## Data consists of 280 haplotypes and 1424 SNPs

Once this is done, we can run the iHS function. The first line of code looks for the haplotypes within the data, and the second calculates the iHS score for every allele:

ehh <- scan_hh(hap,limhaplo=2,limehh=0.05,limehhs=0.05,maxgap=NA,threads=1)
head(ehh)
##          CHR POSITION    freq_A iHH_A iHH_D iES_Tang_et_al_2007
## F1200140  12    79823 0.1500000    NA    NA                  NA
## F1200150  12   125974 0.4071429    NA    NA                  NA
## F1200170  12   175087 0.3571429    NA    NA                  NA
## F1200180  12   219152 0.2214286    NA    NA                  NA
## F1200190  12   256896 0.1750000    NA    NA                  NA
## F1200210  12   316254 0.3892857    NA    NA                  NA
##          iES_Sabeti_et_al_2007
## F1200140                    NA
## F1200150                    NA
## F1200170                    NA
## F1200180                    NA
## F1200190                    NA
## F1200210                    NA
#this will give us a long list of iHS scores... why read the list when we can just plot it?!
ihs <- ihh2ihs(ehh,freqbin=0.025,minmaf=0.05)

Finally, to take in the entirety of our iHS scores, we’ll graph our results! The dotted line represents the threshhold of significance of each iHS score:

ihsplot(ihs, plot.pval=TRUE, ylim.scan=2, pch=16, main="iHS")

What we’re seeing here are two plots representing a map of cow chromosome 12 (which is homologous to human chromosome 13), demarcated on the x-axis by chromosomal positions of SNPs (the ticks on the x-axis are every 20 Mb, or mega-bases, or 20-million-base increments; cow chromosome 12 is a little over 91 Mb long). The y-axis of the first graph is, more or less, the absolute value of the calculated iHS scores; the second is the untransformed value of iHS across the chromosome. The format of these plots is called a Manhattan plot, because the peaks and valleys of the data as you scan across the chromosome look a bit like the skyline of Manhattan.

You can see that that there are several SNPs that are above/below the iHS score significance threshhold, particularly around 30 Mb along chromosome 12. This means that these SNPs exist in the population as part of massively homozygous blocks in this region of the cow genome, which means there was most likely a selective sweep that happened in this population on a phenotype coded by an allele (or multiple alleles) located in this block 30Mb along chromosome 12! Cows have, in the past, apparently adapted to some selective pressure that has acted to maintain homozygosity across the population in this region of their genome. To discover what it is, we’d have to see what genes are in this homozygous haplotype block, but without annotation it’s a bit too difficult for us to manage in this course. If this were the human genome, we would simply go to Ensembl and query this position range and see what gene regions are contained therein. These genes would then be candidate genes for understanding what selective force has been acting on which phenotype.

Now think about what an iHS score might look like in your population. Based on what we’ve discovered today about inferring selection in our populations, do you think that a selective sweep may have happened in the UCP1 region of your population? Is there anything that you know about the environment and your population’s potential adaptations to that environment that you can use to aid in your analysis?

Part 5: What Do Your Results Mean? Discuss with a partner from class:

As we wrap up this lab, think about the results you’ve generated. You have already been prompted to think about what your results might mean as we’ve gone along; now you have the opportunity to talk to a partner about what you think your results might mean.